Document matching on CCITT Group 4 compressed images
نویسنده
چکیده
A method is proposed for detecting whether two CCITT group 4 images were scanned from the same document. Features are extracted from rectangular patches of text and compared with a modified Hausdorff distance measure. Two images are said to be ‘‘equivalent’’ (i.e., they were scanned from the same document) if the Hausdorff measure finds that a specified number of features are located within a given distance of one another in both images. This paper explains the technique and presents experimental results that demonstrate its effectiveness. It is shown that features extracted from a one-inch square patch of image data provide better than 95% correct retrieval accuracy with no false positives on a database of 800 documents.
منابع مشابه
Word Searching in CCITT Group 4 Compressed Document Images
In this paper, we present a compressed pattern matching method for searching user queried words in the CCITT Group 4 compressed document images, without decompressing. The feature pixels composed of black changing elements and white changing elements are extracted directly from the CCITT Group 4 compressed document images. The connected components are labeled based on a line-by-line strategy ac...
متن کاملKeyword Searching in Compressed Document Images
A huge amount of document images are accessible in the Internet and digital libraries. We find that, most of them are packed in PDF files and are compressed using CCITT Group 4 standards for saving storage space and speeding up transmission. There is thus significant meaning to develop the methods of directly searching keywords from these documents. In this paper, we present a compressed patter...
متن کاملSimilarity measure for CCITT Group 4 compressed document images
Similarity measure of document images acts a crucial role in the area of document image retrieval. A method of measuring the similarity of CCITT Group 4 compressed document images is proposed in this paper. The features are extracted directly from the changing elements of the compressed images. Weighted Hausdorff distance is utilized to assign all of the word objects from two document images to...
متن کاملDocument retrieval from compressed images
With the emergence of digital libraries, more and more documents are stored and transmitted through the Internet in the format of compressed images. It is of signi/cant meaning to develop a system which is capable of retrieving documents from these compressed document images. Aiming at the popular compression standard-CCITT Group 4 which is widely used for compressing document images, we presen...
متن کاملGroup 4 Compressed Document Matching
Numerous approaches, including textual, structural and featural, for detecting duplicate documents have been investigated. Considering document images are usually stored and transmitted in compressed forms, it is advantageous to perform document matching directly on the compressed data. A two-stage process for matching Group 4 compressed document images is presented. In the coarse matching stag...
متن کامل